Localization, Extraction and Recognition of Text in Telugu Document Images
نویسندگان
چکیده
In this paper we present a system to locate, extract and recognize Telugu text. The circular nature of Telugu script is exploited for segmenting text regions using the Hough Transform. First, the Hough Transform for circles is performed on the Sobel gradient magnitude of the image to locate text. The located circles are filled to yield text regions, followed by Recursive XY Cuts to segment the regions into paragraphs, lines and word regions. A region merging process with a bottom-up approach envelopes individual words. Local binarization of the word MBRs yields connected components containing glyphs for recognition. The recognition process first identifies candidate characters by a zoning technique and then constructs structural feature vectors by cavity analysis. Finally, if required, crossing count based non-linear normalization and scaling is performed before template matching. The segmentation process succeeds in extracting text from images with complex Non-Manhattan layouts. The recognition process gave a character recognition accuracy of 97%-98%.
منابع مشابه
Document Analysis And Classification Based On Passing Window
In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...
متن کاملA Fast Localization and Feature Extraction Method Based on Wavelet Transform in Iris Recognition
With an increasing emphasis on security, automated personal identification based on biometrics has been receiving extensive attention. Iris recognition, as an emerging biometric recognition approach, is becoming a very active topic in both research and practical applications. In general, a typical iris recognition system includes iris imaging, iris liveness detection, and recognition. This rese...
متن کاملAn Adaptive Character Recognizer for Telugu Scripts Using Multiresolution Analysis, Associative Memory
The present work is an attempt to develop a commercially viable and a robust character recognizer for Telugu texts. We aim at designing a recognizer which exploits the inherent characteristics of the Telugu Script. Our proposed method uses wavelet multiresolution analysis for the purpose extracting features and associative memory model to accomplish the recognition tasks. Our system learns the ...
متن کاملA Comparative Study on Efficiency of Classification Techniques with Zone Level Gabor Features towards Handwritten Telugu Character Recognition
Achieving high accuracies in recognition of handwritten text is a challenging research problem and never exhausting. The factors that instill challenges in handwritten character recognition include high degree of variability in writing, script type and the type of documents etc. In this paper, we focus on recognition of handwritten Telugu text commonly found in document images. The character se...
متن کاملDiscrimination of English to other Indian languages (Kannada and Hindi) for OCR system
India is a multilingual multi-script country. In every state of India there are two languages one is state local language and the other is English. For example in Andhra Pradesh, a state in India, the document may contain text words in English and Telugu script. For Optical Character Recognition (OCR) of such a bilingual document, it is necessary to identify the script before feeding the text w...
متن کامل